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EXHIBIT A to Accompany Amendment and Response in U.S. 
Serial No. 10/074,527 to Office Action Dated Nov. 1, 2004 



Deciphering the Message in Protein Sequences: 
Tolerance to Amino Acid Substitutions 

Iambs U. Bowie/ John F. Reidhaa^Olson, Wbndbll A- Lim, 

Robert T. Saubr 



An amino *d4 mucftct encode* * th * t ***** 

mu^cVtbc Oit^^iUiKtkm of a protein. This memge Is 

I with cmntbBy the wme structure and 
a^^rcy. ^iteifUofi of different «couei*c« wWi abnBar 

OAdmfctndlng of iwrt a protein fold* and haw k pet* 
forms ft* Aincthm. 



TH. OIHOJ4I IS MANIFEST UUXOB1T IN TWl • « 'Of raO- 
trins that k encode*. It is the ability of the*c protdiutofold 
into unique threenlfmenikKuI rtrucrurca that allow* them to 
function and carry out the Instnrctions of the genocne. Thus 
taaiissfthertdlng the rules thtt relate amino acid sequence to struc- 
tutt to Amdtfncnttl cd an undemawUng of biological processes. 
Because an amino add sequence contain* all of the information 
necessary to determine the structure of a protein it should be 
possible to predict structure from sequence, and subsequently to 
Inter detailed aspects of function from the structure* However, both 
problem tie extremery complex* and k seema unlikely that cither 
wttl be solved fcn tn exact rrunner In the near future- It may be 
ooaaibk to obtain awioiimatc solutions by using csjKrirnentsi data 
Jo^mriity the woWem. In this article, describe how an analyrii 
of titowed snuno Add aubattartJons In proteins can be used to 
reduce the complexity of sequences and reveal important aspects of 
structure andruncdon. 

Methods for Studying Tolerance to 
Sequence Variation 

There art two main approaches to studying the tolerance of an 
amino acid sequence to change. The first method reUca on the 
proccas of evolution, in which mutations are either ac cepted o r 
rejected by natural selection. This method has been txurmely 
pcrwerful for proteins such as the gbbins or cytochromes, for which 
sequences from rruny different ipccica arc kndvvti {2-7). The second 
approach uses generic methods to introduce im : no acid changes at 



ipcdfic positions In a cloned gene and uses selections or screens to 
WenntV fonction*! sequences, Tty^ppmkh h as been used to greet 
a^in^e for rirwclrti that t* b$m** •« 
where *e sppfS* generic rn^»U^oni ire possible » f-r I). 
The end results of both rr*thodt arete* nf scdve sequences that can 
be compared and amlyrcd to Iddn^ scqiscnce features that arc 
cssendaifoe folding or function. If a pmkuUr property 'of a side 
chain, such as charge or site, U Important st a gjven r^kkn, onrv 
side chains that have the required proper^ will be allowed- Con- 
versely, if the chcrnkal Hendry of the tide chain is sjnimpomne, 
then many different substitutions wiB be pcrrnhtecL 

Studies In which these methods were used have revealed that 
proteins are surprisingly tolerant of aauno absolutions t£-4. 
If). For example. In studying the effects of innoxlinately 1500 
single amino add substitutions at 141 positions In Uc repressor. 
MMer and co-worien found that about 6o*h}lf of afl substitutions 
were phenotyplcsJIy sllcm (II). ^ 

noocoVucrvative substitutions were allowed. Such reakhac ryltiorta 
l>Uy link tx no In struct^ 

subsrirudont or only conservative substh^itiona were allowed. These 
residues are the most important for \m repressor activity. " 

What roles do invariant and conserved skk chains play m 
proteins* Residues that are directly Involved In protein functions) 
such as binding or catalysis will ccrtaWy be among the most 
corucrved. For essmple, replacing the Asp in the csttxiytic triad of 
trypsin with Asn results In a 10^ fold reduction In activity (12). A 
s^mUar loss of activity occurs In K repressor when a DNA Wndlnsj 
residue b changed from Asn to Asp (IJ). To carry out their 
runcrkxi, however, these catarytk residues and binding residues) 
must be precisely oriented In three dimensions. Consequently, 
mutations in residues that arc required for rtructure formation or 
stability can also have dramatic effects on activity (10, 14-16). 
Hence, many of the residues that srt conserved In sets of related 
sequences play structural roles. 



The wtfcoci i/t in Ifcc Dq»*mrt* of B «**y t MiiuchuKTO Inuirv* of Trthnotojy, 
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Substitutions at Surface and Buried Positions 

In their initial comparisons of the gfobin sequences, Perutx and 
co-workers found that most buried residues require nonpolsr sicK 
chains, whereas tew features of turftcc side chains ire genertlly 
cenrcrved (tf). Similar results have been seen for a number ofrjrotcin 
famines (2, 4. 5, 7. 17. 18). An example of the sequence tolerance al 
surface versus buried sitce can be seen In Fig. 1, which thows the 
allowed substitutions In K repressor at residue Positions that sreneaj 
the dimcr Interface but distant from the DNA binding turrace of *« 
protein (*). These substitutions were Wentiffcd by a fiuxtionxl 
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Rg. t. (A) Amino acid substitutions allowed in a. 
short region of X repressor. The wild- type se- 
quence ts shown along the center tine. The al- 
lowed substitutions shown above each position 
were identified by randomly mutating one to 
three codons at a time by using a cassette method 
and applying a functional selection (9). (B) The 
fractional solvent accessibility (42) of the wild- 
type side chain in the protein dimer (43) relative 
to the same atoms in an Ala -X- Ala model tripep- 
uck. 
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selection after cassette mutagenesis. A histogram of side chain 
solvent accessibility in the crystal structure of die dirxtcr is also 
shown in Fig. I. Ac six positions, only the wild-type residue or 
reUrivciy conservative soihsrinraoris arc allowed, Five of these 
positions arc buried in the protein. In contrast, roost of the highly 
expose d positions tolerate a wide range of chemically different side 
chains, including hydrophilk and hydrophobic residues. Hence, it 
seems that rnost of the structural information in this, region of the 
protein is carried by the residues that arc solvent inaccessible. 



Constraints on Core Sequences 

Because core residue positions appear to be extremely important: 
for protein folding or stability, wc must understand the factors that 
ajcrate whether a given core sequence will be acceptable. In general, 
only hydrophobic or neutral residues arc tolerated at buried sites in 
proteins, undoubtedly because of the large favorable contribution of 
die hydrophobic effect to protein stability (19). For example, Fig. 2 
shows the results of generic studies used to investigate the substitu- 
tions allowed at residue positions chat form the hydrophobic core of 
the rW r terrnirol domain of X repressor (20). The acceptable core 
sequences arc composed almost exclusively of Ala, Cys, Thr, Val, He, 
Leu, Met, and Phe. The acceptability of many different residues at 
each core position presumably reflects the tact that the hydrophobic 
effect, unlpcc hydrogen boncjing, docs not depend on specific 
residue pairings. Although it ts possible to imagine a hypothetical 
core structure that is stabilized exclusively ' by 'residues forming 
hydrogen bonds and salt bridges, such a core would probably be 
difficult to construct because hydrogen bonds require pairing of 
demors^hd acceptors in ah exact geometry. Thus die repertoire of 
possible structures that use a polar core would probably be extreme- 
fy limited (21). Polar and charged residues are occasionally found in 
the cores of proteins, but ortry at positions where their hydrogen 
bonding needs can be satisfied (22). 

The cores of most proteins are quite closely packed (23), but some 
volume changes are acceptable. In X repressor, the overall core 
volume of acceptable sequences can vary by about 1096. Changes ax 
individual sites, however, can be considerably larger. For example, 
as shown in Fig. 2, both Phc and Ala arc allowed at the same core 
position in the appropriate sequence contexts. Large volume 
changes at individual buried sites have also been observed in 



phyfogcnetic studies, where it has been noted that the size decreases 
and increases at interacting residues are not necessarily related in a 
simple complementary fashion (J, 7. 17). Rather, local volume 
changes arc accommcdatcd by conibmunohal changes in nearby 
side chains and by a variety of backbone movcmeiiis. 

The Informational Importance of the Core 

With occasional exceptions, the core must remain hydrophobic 
and maintain a reasonable packing density. However, since the core 
is composed of side chains that can assume only a limited nurnber of 
conformations (24), efficient packing must be maintained without 
steric dashes. How iraporrant arc hydrophobidty, vorume, and 
steric complementarity in determining whether a given sequence can 
form an acceptable core? Each factor is essential in a physical sense, 
as a stable core is probably unable to tolerate unsatisfied hydrogen 
bonding groups, large holes, or steric overlaps (25) . However, in an 
informational sense, these factors arc not equivalent. For example, in 
experimertrs in which three . core residues of X repressor were 
mutated simultaneously, volume was a relatively unimportant infor- 
mational constraint because rhirc-cjuartcra of all possible combina- 
tions of the 20 naturally occurring amino adds had volumes within 
the range tolerated in the core, and yet most of these sequences were 
unacceptable (20). In. contrast, of the sequences that contained only 



Flo. 2. Amino add substitu- 
tions allowed in the core of X 
repressor. The wild-type side 
chains are shown pkxorially in 
the approximate orientation 
seen m the crystal structure 
(43). The lists of allowed sub- 
surtmons at each position are 
shown below the wild-type 
side chains. These substitu- 
tions were identified by ran- 
domly mu tating one to four 
residues at a raroc by tisins^ a 
casscrxc method and applying 
a functional selection (20). 
Not all substitutions arc al- 
lowed in every sequence back- 
ground. 
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die appropriate hydrophobic residues, a significant fraction were 
acceptable. Hence, die hydrophobictty of a sequence contains 
more information about its potential acceptability in the core than 
does the total side chain volume. Stcric compatibility was intermedi- 
ate between volume and hydrophobirity in informational impor- 
tance. 



The Informational Importance of Surface Sites 

We have noted that many surface sices can tolerate a wide variety 
of side chains, including hydrophiiic and. hydrophobic residues. This 
result might be taken to indicate that surface positions contain litdc 
structural information. However, Bashford et a!., in an extensive 
analysis of globin sequences found a strong bias against large 
hydrophobic residues at many surface positions. .At one level, this 
may reflect constraints imposed by protem solubility, because large 
patches of hydrophobic surface residues would presumably lead to 
aggregation. At a more fundamental level, protein folding requites a 
partitioning between surface and buried positions. Consequently, to 
achieve a unique native state without significant competition from 
other conformations, it may be important that some sites have a 
decided preference for exterior rather than interior positions. As a 
result, many surface sites can accept hydrophobic residues individ- 
ually, but the surface as a whole can probably tolerate only a 
moderate number of hydrophobic side chains. 



Implications for Structure Prediction 

At present, the only reliable method for predicting a low- 
resolution tertiary structure of a new protein is by identifying 
sequence similarity to a protein whose structure is already known 
(29, 30). However, it is often difficult to align sequences as the level 
of sequence similarity decreases, and it is sometimes impossible to 
detect statistically significant sequence similarity between distantly 
related proteins. Because the number of known sequences is far 
greater than the number of known structures, it would be advanta- 
geous to increase the reach of the available structural information by 
improving methods for detecting distant sequence relations and for 
subsequently aligning these sequences based on structural principlcs. 
In a normal homology search, the sequence database is scanned with 
a single test sequence, and every residue must be weighted equally. 
However, some residues are more imponant than others and should 
be weighted accordingly. Moreover, certain regions of the protein 
are more likely to contain gaps than others. Both kinds of informa- 
tion can be obtained from sequence sets, and several techniques have 



2 stfJ?*-' * *■ . I »;s& 



Identification of Residue Roles from 
Sets of Sequences 

Often, a protein of interest is a member of a family of related 
sequences. What can we infer from the pattern of allowed substitu- 
tions at positions in sets of aligned sequences generated by generic 
or phyfogenctic methods? Residue positions that can accept a 
number of different side chains, including charged and highly polar 
residues, arc. almost certain to be on the protein surface. Residue 
3ositions that remain hydrophobic, whether variable or not, are 
ikcly to be buried within the structure. In Fig. 3, those residue 
radons in A repressor that can accept hydrophiiic side chains arc 
hown in orange and those that cannot accept hydrophiiic side 
hains are shown in green. The obligate hydrophobic positions 
kfine the core of die structure, whereas positions that can accept 
ydrophflic aide chains define the surface. . 

Function airy important residues should be conserved in sets of 
aivc sequences, but it is not possible to decide whether a side chain 
functionally or srnicturaiiy important just because it is invariant or 
ansetved To make this distinction requires an independent assay of 
rotein folding: The ability of a mutant protein to maintain a stably 
Wed structure can often be measured by biophysical techniques, 
' susceptibility to imracdjular proteolysis (26), or by binding to 
cribodies . specific for die native structure (27, 25). In the . latter 
$es, k is possible to screen proteins in mutated clones for the 
fliry to fold even if these proteins' are inactive. Sets of sequences 
It allow formation of a stable structure can then be compared to 
i sets that allow both folding and function, with the active site or 
iding residues being those that arc variable in the set of stable 
ttcins but invariant in the set of functional proteins. The DNA- 
iding residues of Arc repressor were identified by this method (£). 
e receptor-binding residues of human growth hormone were also 
ironed by comparing the stabilities and activities of. a set of 
tant sequences (28). However, in this case, the mutants were 
serated as hybrid sequences between growth hormone and related 
moncs with different binding specificities. 



Fig. 3. Tc4crancc of positions in die NH 2 -tcrmini] domain of \ repressor to 
hydrophJK sKk chains The comply (43) of the repressor diroer (Mac) and 
°P^^. n DI ?A (white) is shown. In (A), positions that can tolerate 
hytoplufc sulc Aains « shew^ 

IT J ' ^dwat the remaining proton atoms. In (C), positions that require 
hydrophobe* neutral aide c^ 

sbo^ri in (O) without the remaining protein atoms. About three-fourths of 
the : 92 side chains in the NHj-tcrminal domain arc induded in both (B) and 
(D)^e remaining positions have not been tested. Data arc from {9, 14. 20. 
*7 # 44). 
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been used to combine such "mforrrurion into more appropriately, 
weighted sequence searches and alignments (31). These methods 
were used to align the sequences of retroviral proteases with asparoc 
proteases, which in turn allowed construction of a thxee-durknsion- 
aJ model for the protease of human immunodeficiency virus type 1 
(29). Comparison with the recently determined crystal structure of 
this protein revealed reasonable agreement in many areas of the 
predicted structure {32). 

The structural irifbrmarion at most surface sites is highly degener- 
ate Except for functionally important residues, -exterior positions 
seem to be important chiefly in maintaining a reasonably, polar 
surface. The information contained in buried residues is also 
degenerate, die main requirement being that these residues remain 
hydrophobic. Thus, at its most basic level, die key structural 
message in an amino acid sequence may reside in its specific pattern 
of hydrophobic and hydrephiBc residues. This- is meant in an 
informational sense. Clearly, die precise structure and stability of a. 
protein depends on a Urge number of detailed interactions. It is 
possible, however, that scructural prediction at a more primitive 
level can be accomplished by cbnrrnrraring on the most: basic 
bifonnational aspects of an amino acid sequence. For example, 
amprupathic-patrerrts can be extracted from aligned sets of sequences 
and used, in some cases, to identify secondary structures. . 

If a region of secondary structure is packed against the hydropho- 
bic core, a partem of hydrophobic residues reflecting the periodicity 
of the sixoncUry stnicturc is expected (33, 34). These patterns can be 
obscured in individual sequences by hydrophobic, residues on the 
protein surface.. It is rare, however, for a surface position to remain 
hydrophobic over the course of evolution. Consequently, the am- 
pcuparhic patterns expected for simple secondary structures can be 
much dearer in a set of related sequences (6). This principle is 
illustrated in Fig. 4, which shows helical hydrophobic moment plots 
for the Ahtenrupcdia honxodomain sequence (Fig. 4 A) and for a 
composite sequence derived from a set of homologous horneodo- 
main proteins (Fig. 4B) (35). The hydrophobic moment is a simple 
measure of the degree of amphipathic character of a sequence in a 
given secondary structure (34). The amphiparhic character ' of the 
three a-heOcal regions in the Aruennapcdia protein {36) is clearly 
revealed only by, the analysis of the combined sex of homcodornain 
sequences. The secondary structure of Arc repressor, a small DNA- 
binding protein, was recently predicted by a similar method (S) and 
confirmed by nuclear magnetic resonance studies (37). 

The specific pattern of hydrophobic and hydrophilic residues in 
an amino arid sequence must firnit the number of different structures 
. a given sequence can adopt and may indeed define us overall fold. If 
this is true, then the arrangement of hydrophobic and hydrophilic 
residues should be a characteristic feature ofa particular fold. Sweet 
and Etsenberg have shown that the correlation of the, pattern of 
hydrophobkity between two protein sequences is a good criterion 
for their structural rclatcdness (38). In addition, several studies 
indicate that patterns of obligatory hydrophobic positions Identified 
from aligned sequences arc distinctive features of sequences that 
adopt the . same structure {4, 29, 38, 39). Titus* the order of 
hydrophobic and hydrophilic residues hi a sequence may actually be 
sufficient uiforrnarion to determine the basic folding pattern of a 
protein sequence. 

Although the pattern of sequence hydro phobidty may be a 
characteristic feature of a particular fold, it is not yet clear how such 
patterns could be used for prediction of structure dc novo. It is 
important to understand how patterns in sequence space can be 
related to structures in conformation space. Lau and Dill , have 
approached this, problem by studying the properties of simple 
sequences composed only of H (hydrophobic) and P (polar) groups 
on two-cumensional btticcs (40). An example of such a representa- 



tion is shown in Fig. 5. Residues adjacent in the sequence must 
occupy adjacent squares on the brace, and two residues cannot 
occupy the same spec. Free energies of particular conformations ate 
evaluated, with a single term, an attraction of H groups. By 
considering chains of ten residues, an exhaustive confbomoonal 
search for all 1024 possible sequences' of H and P residues was. 
possible. For longer sequences only a representative fraction of the* 
allowed sequence or conformation space could be carpteced. The 
significant results were as follows: (i) not all sequences can fold into 
a ^native" structure and only a few sequences form a unique native 
structure; (ii) the probability that a sequence will adopt a unique 
native structure increases with chain length; and (tit) the native 
states arc compact, contain a. hydrophobic core surrounded by polar 
residues, and contain significant secondary' structure. Although the 
gap between these rwo-dimensional simulations and three-dimen- 
sional structures is large, the use of simple rules and sequence 
representations yields results similar to those expected for real 
proteins. Thrcc-dimeruional lattice methods arc also beginning to 
be developed and evaluated (4 1). 



Summary 

There is more information in a set of related sequences than in a 
single sequence. A number of practical applications arise from an 
analysis of the tolerance of residue positions to change. First, such 
infomution permits the evaluation of a residue's importance to the 
function and stability of a protein. This ability to identify the 
essential dements of a. protein sequence may improve our under- 
standing of die determinants of protein folding and stability as. well 
as protein function. Second, partcrns of tolerance to amino acid 
substitutions of varying hydrophilicity can help, to identify residues 
likely to be buried in a protein structure and those likely to occupy 



Fig. 4. Helical hydro- 
phobic moments calcu- 
lated by using (A) the 
Amcrmapcduhorriec«do- 
rrajn sequence or (B) a 
set of 39 aligned homco 
domain sequences (35). 
The ban indicate the ex- 
tra* of the hefical re- 
gions identified in nucle- 
ar magnetic resonance 
studies of the Antenna- 




(36)- To determine hy- 
drophobic- moments, 
residues were assigned 
to one of three groups: 
HI (high hydrophobic! - 
ry - Trp. Ik, Phc, Leu, 
Met, Val, or Cys); H2 
(medium hydrophobic- 
ity ~ Tyr» Pro, Ala, Thr, 
His, Gty, or Scr); and H3 (low hydrophobidty - Gin, Asn, du, Asp, Lya, 
or Arg). For the aligned homcodornain sequences, the residues ar each 
.position were sorted by their hydrophobidty by using the scale of Fauchcrc 
and Pliska {45). Arg and Lys were not counted unless no other residue was 
found at the position, because they contain long afipbapc side chains and can 
thereby mbsorute for n on polar residues at some buried sites. To account for 
possible sequence errors and rare exceptions, the most hydrophilic residue 
allowed at cadi position was discarded unless it was observed twice. The 
second most hydrophilic residue was then chosen to represent the hydropho- 
bic! ty of each position. An eight-residue window was used and the vectors 
projected radially every 1 00*. The rector magnitudes wcxr assigned a value of 
i, 0, or -I for positions where the trydrophc&crry group was HI, HZ, or 
K3, respectively. 
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Rg. 5, A representation ofnhe com- 
pact conformation for a particular 
sequence of H and P residues oa a 
square l att ice. 
[Adapted from (<0), with parrcis- 
xlon of the American Chemical Sod** 



surface positions. The amphipathic patterns that emerge can be used 
tb identify probable regions of secondary structure. Third, incorpo- 
rating a knowledge of allowed substitutions can improve the ability 
to detect and align distantly related proteins because die essential 
residues can be given prominence in the alignment scoring. 
-." A* more sequences arc determined, it becomes increasingly fikdy 
that a protein of interest is a member of a family of related 
sequences. If this is riot the case, it is new possible to use genetic 
methods to generate lists of allowed amino acid substitutions. 
Consequently, at least in the short term, it may hot be necessary to 
■solve the folding problem for individual protein sequences. Instead, 
information from sequence sets could be used. Perhaps by simplify- 
ing sequence space through the identification of key residues, and by 
simplifying confoirnation space as in the b trice methods, it will be 
possible to develop algorithms to generate a limited number of trial. 
«ructurcs. These trial structures could men, in turn, be evaluated by 
further esqperiments and more sopru^catcd energy calculations. 
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